Data Cleaning in SQL

Continuing my SQL projects, I will be performing data cleaning in a more advanced way. The data set that I will be cleaning have the information about Nashville Housing, that contains information about the Property Address, Sales Date, Sales Price, Owner Name and much more.

Importing Libraries

Establishing Connection

In order to use SQL databases, you must establish connection with Microsoft SQL Server. One way to do that, is using “dbConnect”:

conn2 <- dbConnect(odbc:: odbc(),"SQLServer_DSN" )

Cleaning Data

First View

Let’s take a first look into our data:

SELECT * 
FROM "Portfolio"."dbo"."nashville"

Changing Data Format

ALTER TABLE Portfolio.dbo.nashville 
ADD SaleDateConverted Date; 

UPDATE Portfolio.dbo.nashville 
SET SaleDateConverted = CONVERT(Date,SaleDate)
SELECT SaleDateConverted, SaleDate  
FROM Portfolio.dbo.nashville

Populate Property Address data

First let’s take a look if there is missing values on the PropertyAddress column.

SELECT PropertyAddress
FROM "Portfolio"."dbo"."nashville"
WHERE PropertyAddress IS NULL
Table 1: 0 records
PropertyAddress
SELECT ParcelID, PropertyAddress
FROM "Portfolio"."dbo"."nashville"
ORDER BY ParcelID
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress) AS NewPropertyAddress
From "Portfolio"."dbo"."nashville" a
JOIN "Portfolio"."dbo"."nashville" b
    on a.ParcelID = b.ParcelID
    AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
Update a
SET PropertyAddress = ISNULL(a.PropertyAddress,b.PropertyAddress)
From "Portfolio"."dbo"."nashville" a
JOIN "Portfolio"."dbo"."nashville" b
    on a.ParcelID = b.ParcelID
    AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
Select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress,b.PropertyAddress) AS NewPropertyAddress
From "Portfolio"."dbo"."nashville" a
JOIN "Portfolio"."dbo"."nashville" b
    on a.ParcelID = b.ParcelID
    AND a.[UniqueID ] <> b.[UniqueID ]
Where a.PropertyAddress is null
Table 2: 0 records
ParcelID PropertyAddress ParcelID PropertyAddress NewPropertyAddress

Breaking out Address into Individual Columns (Address, City, State)

In our column, PropertyAddress, we can see that there a comma, separating the address from the city. Let’s create a new column to get the name of the cities.

ALTER TABLE Portfolio.dbo.nashville
ADD PropertySplitAddress Nvarchar(255);

Update Portfolio.dbo.nashville
SET PropertySplitAddress = SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1 )
ALTER TABLE Portfolio.dbo.nashville
ADD PropertySplitCity Nvarchar(255);

Update Portfolio.dbo.nashville
SET PropertySplitCity = SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) + 1 , LEN(PropertyAddress))
Select PropertySplitAddress, PropertySplitCity
From Portfolio.dbo.nashville

We also can do something very similar to get information on Address, City and State for the OwnerAddress column. But first, let’s look on this column:

SELECT OwnerAddress
FROM "Portfolio"."dbo"."nashville"
Table 3: Displaying records 1 - 10
OwnerAddress
1808 FOX CHASE DR, GOODLETTSVILLE, TN
1832 FOX CHASE DR, GOODLETTSVILLE, TN
1864 FOX CHASE DR, GOODLETTSVILLE, TN
1853 FOX CHASE DR, GOODLETTSVILLE, TN
1829 FOX CHASE DR, GOODLETTSVILLE, TN
1821 FOX CHASE DR, GOODLETTSVILLE, TN
2005 SADIE LN, GOODLETTSVILLE, TN
1917 GRACELAND DR, GOODLETTSVILLE, TN
1428 SPRINGFIELD HWY, GOODLETTSVILLE, TN
1420 SPRINGFIELD HWY, GOODLETTSVILLE, TN
-- Address
ALTER TABLE Portfolio.dbo.nashville
Add OwnerSplitAddress Nvarchar(255);

Update Portfolio.dbo.nashville
SET OwnerSplitAddress = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 3)
-- City
ALTER TABLE Portfolio.dbo.nashville
Add OwnerSplitCity Nvarchar(255);

Update Portfolio.dbo.nashville
SET OwnerSplitCity = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 2)
-- State
ALTER TABLE Portfolio.dbo.nashville
Add OwnerSplitState Nvarchar(255);

Update Portfolio.dbo.nashville
SET OwnerSplitState = PARSENAME(REPLACE(OwnerAddress, ',', '.') , 1)
SELECT OwnerSplitAddress, OwnerSplitCity, OwnerSplitState
FROM Portfolio.dbo.Nashville

Change Y and N to Yes and No in “Sold as Vacant” field

Let’s change the values on SoldAsVacant column: Y -> Yes and N -> No. 

SELECT DISTINCT(SoldAsVacant), Count(SoldAsVacant)
FROM "Portfolio"."dbo"."nashville"
Group by SoldAsVacant
order by 2
Table 4: 2 records
SoldAsVacant
Yes 4669
No 51704
Update Portfolio.dbo.nashville
SET SoldAsVacant = CASE When SoldAsVacant = 'Y' THEN 'Yes'
       When SoldAsVacant = 'N' THEN 'No'
       ELSE SoldAsVacant
       END
SELECT DISTINCT(SoldAsVacant), Count(SoldAsVacant)
FROM "Portfolio"."dbo"."nashville"
Group by SoldAsVacant
order by 2
Table 5: 2 records
SoldAsVacant
Yes 4669
No 51704

Remove Duplicates

Let’s find out all the duplicate values in our data set:

WITH RowNumCTE AS(
Select *,
    ROW_NUMBER() OVER (
    PARTITION BY ParcelID,
                 PropertyAddress,
                 SalePrice,
                 SaleDate,
                 LegalReference
                 ORDER BY
                    UniqueID
                    ) row_num

FROM "Portfolio"."dbo"."nashville"
)

Select *
From RowNumCTE
Where row_num > 1
Order by PropertyAddress

And delete them:

WITH RowNumCTE AS(
Select *,
    ROW_NUMBER() OVER (
    PARTITION BY ParcelID,
                 PropertyAddress,
                 SalePrice,
                 SaleDate,
                 LegalReference
                 ORDER BY
                    UniqueID
                    ) row_num

FROM "Portfolio"."dbo"."nashville"
)

DELETE
From RowNumCTE
Where row_num > 1
WITH RowNumCTE AS(
Select *,
    ROW_NUMBER() OVER (
    PARTITION BY ParcelID,
                 PropertyAddress,
                 SalePrice,
                 SaleDate,
                 LegalReference
                 ORDER BY
                    UniqueID
                    ) row_num

FROM "Portfolio"."dbo"."nashville"
)

Select *
From RowNumCTE
Where row_num > 1
Order by PropertyAddress

Delete Unused Colunms

ALTER TABLE Portfolio.dbo.nashville
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate;

Thank you!